The Combinatorial BLAS: design, implementation, and applications
نویسندگان
چکیده
This paper presents a scalable high-performance software library to be used for graph analysis and data mining. Large combinatorial graphs appear in many applications of high-performance computing, including computational biology, informatics, analytics, web search, dynamical systems, and sparse matrix methods. Graph computations are difficult to parallelize using traditional approaches due to their irregular nature and low operational intensity. Many graph computations, however, contain sufficient coarse grained parallelism for thousands of processors, which can be uncovered by using the right primitives. We describe the parallel Combinatorial BLAS, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications. We provide an extendible library interface and some guiding principles for future development. The library is evaluated using two important graph algorithms, in terms of both performance and ease-ofuse. The scalability and raw performance of the example applications, using the combinatorial BLAS, are unprecedented on distributed memory clusters.
منابع مشابه
FlexiBLAS - A flexible BLAS library with runtime exchangeable backends
The BLAS library is one of the central libraries for the implementation of numerical algorithms. It serves as the basis for many other numerical libraries like LAPACK, PLASMA or MAGMA (to mention only the most obvious). Thus a fast BLAS implementation is the key ingredient for efficient applications in this area. However, for debugging or benchmarking purposes it is often necessary to replace t...
متن کاملBLAS on the Trident Processor: Implementation and Performance Evaluation
This paper describes the implementation of the Basic Linear Algebra Subprograms (BLAS), which are widely used in many applications, on the Trident processor. We show how to use the Trident parallel execution units, ring, and communication registers to effectively perform vector-vector, matrix-vector, and matrix-matrix operations needed for implementing BLAS. The TFLOPS rate on infinite-size pro...
متن کاملA High Performance FPGA-Based Accelerator for BLAS Library Implementation
This paper describes the implementation and the performance analysis of a hardware accelerator for the BLAS library matrix multiplication operation. This accelerator is based on a dual-FPGA board and on an implementation BLAS software library making use of the FPGA-based hardware. In order to evaluate the performance of such a system, we implemented the matrix multiplication operation (BLAS “dg...
متن کاملImplementation of the Blas Level 3 and Linpack Benchmark on the Ap1000
This paper describes an implementation of Level 3 of the Basic Linear Algebra Subprogram (BLAS-3) library and the Linpack Benchmark on the Fujitsu AP1000. The performance of these applications is regarded as important for distributed memory architectures such as the AP1000. We discuss the techniques involved in optimizing these applications without significantly sacrificing numerical stability....
متن کاملA Simulated Annealing Algorithm for Unsplittable Capacitated Network Design
The Network Design Problem (NDP) is one of the important problems in combinatorial optimization. Among the network design problems, the Multicommodity Capacitated Network Design (MCND) problem has numerous applications in transportation, logistics, telecommunication, and production systems. The MCND problems with splittable flow variables are NP-hard, which means they require exponential time t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IJHPCA
دوره 25 شماره
صفحات -
تاریخ انتشار 2011